Abstract
This project develops a model to forecast the bicycle rental demand in DC based on historical data, bike locations, and features of the weather. Three datasets from 2015 were utilized: 1) Capital Bikeshare ridership, containing information for individual bike rentals throughout the year, 2) Wunderground dataset for daily DC weather information, and 3) Google Maps API location data. Multiple Linear Regression, Decision Tree, and Random Forest models were tested. According to our primary model performance metric, Mean Absolute Error, the Random Forest model provided the most accurate predictionsbike_q1 <- read.csv("2015-Q1-Trips-History-Data.csv")
bike_q2 <- read.csv("2015-Q2-Trips-History-Data.csv")
bike_q3 <- read.csv("2015-Q3-cabi-trip-history-data.csv")
bike_q4 <- read.csv("2015-Q4-Trips-History-Data.csv")
weather <- read.csv('weather_2015.csv')
bike <-read.csv("Capital_Bike_Share_Locations.csv",header=TRUE, sep=",")
\section{1. Introduction}
In the last decade there has been increasing concern regarding the environment and the quality of life, especially in big cities. From increasing taxation to financial incentives, different approaches and public policies have been proposed and tested all around the world to address these concerns. In this scenario, shared cars and shared bicycles have became popular solutions in many cities to help mitigate traffic and environmental impact. How can these programs be set up for success?
\section{Research Question}Due to the increasing importance and popularity of the Capital Bikeshare program, this project aims to: 1) identify the variables that most impact hourly ridership, and 2) develop a model to predict hourly bikeshare demand in the Greater Washington DC region based on historical ridership and weather data for 2015.
\section{2. Methods} \section{2.1. Dataset}This will be a short description of the data sets, how the data is collected, causality, bias, etc
To conduct the analysis three datasets from 2015 were utilized: 1) Capital Bikeshare ridership, containing information for individual bike rentals throughout the year, 2) Wunderground dataset for daily DC weather information, and 3) Google Maps API location data.
\section{2.2.Data Wrangling, Cleaning, and Feature Engineering}The anlaysis began by loading the R packages and the raw datasets for trip history and weather.
With the raw datasets loaded, the first challenge was to combine the various datasets.
The trip history datasets were addressed first. We reviewed the variables available in each set and adjusted them as necessary to ensure consistency across the separate datasets.
names(bike_q1)
## [1] "Total.duration..ms." "Start.date" "Start.station"
## [4] "End.date" "End.station" "Bike.number"
## [7] "Subscription.Type"
names(bike_q2)
## [1] "Duration..ms." "Start.date" "Start.station"
## [4] "End.date" "End.station" "Bike.number"
## [7] "Subscription.type"
names(bike_q3)
## [1] "Duration..ms." "Start.date" "End.date"
## [4] "Start.station.number" "Start.station" "End.station.number"
## [7] "End.station" "Bike.." "Member.type"
names(bike_q4)
## [1] "Duration..ms." "Start.date" "End.date"
## [4] "Start.station.number" "Start.station" "End.station.number"
## [7] "End.station" "Bike.." "Member.type"
bike_q3$Start.station.number <- NULL
bike_q3$End.station.number <- NULL
bike_q4$Start.station.number <- NULL
bike_q4$End.station.number <- NULL
names(bike_q2)[1]<-paste("Total.duration..ms.")
names(bike_q2)[2]<-paste("Start.date")
names(bike_q2)[3]<-paste("Start.station")
names(bike_q2)[4]<-paste("End.date")
names(bike_q2)[5]<-paste("End.station")
names(bike_q2)[6]<-paste("Bike.number")
names(bike_q2)[7]<-paste("Subscription.Type")
names(bike_q3)[1]<-paste("Total.duration..ms.")
names(bike_q3)[2]<-paste("Start.date")
names(bike_q3)[3]<-paste("End.date")
names(bike_q3)[4]<-paste("Start.station")
names(bike_q3)[5]<-paste("End.station")
names(bike_q3)[6]<-paste("Bike.number")
names(bike_q3)[7]<-paste("Subscription.Type")
names(bike_q4)[1]<-paste("Total.duration..ms.")
names(bike_q4)[2]<-paste("Start.date")
names(bike_q4)[3]<-paste("End.date")
names(bike_q4)[4]<-paste("Start.station")
names(bike_q4)[5]<-paste("End.station")
names(bike_q4)[6]<-paste("Bike.number")
names(bike_q4)[7]<-paste("Subscription.Type")
names(bike_q1)
## [1] "Total.duration..ms." "Start.date" "Start.station"
## [4] "End.date" "End.station" "Bike.number"
## [7] "Subscription.Type"
names(bike_q2)
## [1] "Total.duration..ms." "Start.date" "Start.station"
## [4] "End.date" "End.station" "Bike.number"
## [7] "Subscription.Type"
names(bike_q3)
## [1] "Total.duration..ms." "Start.date" "End.date"
## [4] "Start.station" "End.station" "Bike.number"
## [7] "Subscription.Type"
names(bike_q4)
## [1] "Total.duration..ms." "Start.date" "End.date"
## [4] "Start.station" "End.station" "Bike.number"
## [7] "Subscription.Type"
With variable consistency across the individual datasets, we were able to use row bind to combine Q1-Q4 of trip history data.
bike_df <- rbind(bike_q1, bike_q2)
bike_df <- rbind(bike_df, bike_q3)
bike_df <- rbind(bike_df, bike_q4)
dim(bike_df)
## [1] 3192908 7
Next, focused on combining the full 2015 ridership dataset with the full 2015 weather dataset. To do this, we had to identify a common variable between the datasets. We chose to use “date” in the ymd format. With the common variable in place in both datasets, we used a column bind to combine the ridership and weather datasets.
bike_df$Start.date<-mdy_hm(bike_df$Start.date)
bike_df$End.date<-mdy_hm(bike_df$End.date)
str(bike_df$Start.date)
## POSIXct[1:3192908], format: "2015-01-01 00:02:00" "2015-01-01 00:02:00" ...
str(bike_df$End.date)
## POSIXct[1:3192908], format: "2015-01-01 00:42:00" "2015-01-01 00:42:00" ...
bike_df$date<-date(bike_df$Start.date)
str(bike_df$date)
## Date[1:3192908], format: "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" ...
str(weather$EST)
## Factor w/ 365 levels "2015-1-1","2015-1-10",..: 1 12 23 26 27 28 29 30 31 2 ...
weather$EST<-ymd(weather$EST)
str(weather$EST)
## Date[1:365], format: "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
names(weather)[1]<-paste("date")
str(weather$date)
## Date[1:365], format: "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
bike_weather <- merge(bike_df,weather,by="date")
dim(bike_weather)
## [1] 3192908 30
names(bike_weather)
## [1] "date" "Total.duration..ms."
## [3] "Start.date" "Start.station"
## [5] "End.date" "End.station"
## [7] "Bike.number" "Subscription.Type"
## [9] "Max.TemperatureF" "Mean.TemperatureF"
## [11] "Min.TemperatureF" "Max.Dew.PointF"
## [13] "MeanDew.PointF" "Min.DewpointF"
## [15] "Max.Humidity" "Mean.Humidity"
## [17] "Min.Humidity" "Max.Sea.Level.PressureIn"
## [19] "Mean.Sea.Level.PressureIn" "Min.Sea.Level.PressureIn"
## [21] "Max.VisibilityMiles" "Mean.VisibilityMiles"
## [23] "Min.VisibilityMiles" "Max.Wind.SpeedMPH"
## [25] "Mean.Wind.SpeedMPH" "Max.Gust.SpeedMPH"
## [27] "PrecipitationIn" "CloudCover"
## [29] "Events" "WindDirDegrees"
While the ridership data set provided some geographic information, we wanted a more robust set of geographic variables to be available, inclusive of things such as city, zip, etc. Therefore, we used revgeocode to extract additional geographic variables from the Google Maps API using the lat/long data that was included in the ridership dataset. The additional variables were combined to the merged ridership and weather dataset using row bind.
# Import dataset
bike<-read.csv("Capital_Bike_Share_Locations.csv",header=TRUE, sep=",")
# Extract gps info for each bike location
gps <- bike[c(3,5,6)]
# Extract bike full address from gps info
ad <- do.call(rbind,lapply(1:nrow(gps),function(i)revgeocode(as.numeric(gps[i,3:2])))) # Extract full address
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.858971,-77.05323&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.85725,-77.05332&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.856425,-77.049232&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.86017,-77.049593&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.857866,-77.05949&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.862303,-77.059936&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8637,-77.0633&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.857063,-77.051141&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8629,-77.0528&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.848441,-77.051516&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8426,-77.0502&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8533,-77.0498&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.850688,-77.05152&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9003,-77.0429&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9176,-77.0321&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.929464,-77.027822&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.926088,-77.036536&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.922925,-77.042581&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9268,-77.0322&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.923203,-77.047637&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9319,-77.0388&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8767,-77.0178&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90985,-77.034438&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.912682,-77.031681&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9086,-77.0323&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8963,-77.045&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9008,-77.047&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.936043,-77.024649&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9375,-77.0328&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9346,-76.9955&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90304,-77.019027&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8896,-76.9769&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9308,-77.0315&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8601,-76.9672&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.934267,-77.057979&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.878,-76.9607&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897063,-76.947446&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.901385,-76.941877&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.862669,-76.994637&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.867373,-76.988039&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8952,-77.0436&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.919077,-77.000648&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9172,-77.0259&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9249,-77.0222&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8743,-77.0057&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9154,-77.0446&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9155,-77.0222&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.853531,-77.053509&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8763,-77.0037&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9101,-77.0444&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9057,-77.0056&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90534,-77.046774&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90276,-77.03863&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.899408,-77.015289&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8851,-77.0023&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8803,-76.9862&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884,-76.9861&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9121,-77.0387&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.944551,-77.063896&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9126,-77.0135&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8792,-76.9953&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.938736,-77.087171&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.934881,-77.072755&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.947774,-77.032818&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.865784,-76.9784&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.873057,-76.971015&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.886952,-76.996806&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.899632,-77.031686&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894,-76.947974&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.887237,-77.028226&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.902221,-77.059219&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.933668,-76.991016&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894919,-77.046587&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.886266,-77.022241&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.893028,-77.026013&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897293,-77.05557&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884,-76.995397&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.904742,-77.041606&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.947607,-77.079382&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9003,-76.9882&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897222,-77.019347&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8991,-77.0337&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.905737,-77.02227&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.927872,-77.043358&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903407,-77.043648&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90375,-77.06269&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.87675,-77.02127&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894758,-76.997114&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.916442,-77.0682&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.900283,-77.029822&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8997,-77.023086&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.932514,-76.992889&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.910972,-77.00495&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.899972,-76.998347&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.900412,-77.001949&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.900413,-76.982872&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889955,-77.000349&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.890461,-76.988355&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8692,-76.9599&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894832,-76.987633&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.91554,-77.03818&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898364,-77.027869&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894514,-77.031617&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897351,-77.022465&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.902061,-77.038322&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.908905,-77.04478&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.895344,-77.016106&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8923,-77.0436&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90774,-77.071652&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.899983,-76.991383&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903827,-77.053485&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89696,-77.00493&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897446,-77.009888&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.905607,-77.027137&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.922581,-77.070334&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898069,-77.031823&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.844015,-77.050537&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88412,-77.04657&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897857,-77.026975&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896104,-77.049882&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897315,-77.070993&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8946,-77.072305&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.893438,-77.076389&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.891696,-77.0846&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.892164,-77.079375&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.86559,-76.952103&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.892459,-77.046567&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889,-77.0925&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.920669,-77.04368&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.906602,-77.038785&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9059,-77.0325&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8904,-77.0889&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8881,-77.09308&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88786,-77.094875&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89968,-77.041539&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.887299,-77.018939&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88412,-77.017445&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.881185,-77.001828&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.912719,-77.022155&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9418,-77.0251&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.915417,-77.012289&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928156,-77.02344&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.917761,-77.04062&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889935,-76.93723&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.930282,-77.055599&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896544,-76.96012&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.905126,-77.056887&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88732,-76.983569&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.844711,-76.987823&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.876393,-77.107735&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8815,-77.10396&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896923,-77.086502&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9024,-77.02622&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.885801,-77.097745&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896015,-77.078107&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.87861,-77.006004&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.922649,-77.077271&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928743,-77.012457&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.882788,-77.103148&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88397,-77.10783&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884734,-77.093485&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.888553,-77.032429&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.888767,-77.02858&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.87887,-77.1207&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894573,-77.01994&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.893241,-77.086045&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89593,-77.089006&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89054,-77.08095&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.880834,-77.091129&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.881044,-77.111768&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.882629,-77.109366&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.879819,-77.037413&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.866611,-76.985238&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.883921,-77.116817&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.880012,-77.107854&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884616,-77.10108&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90509,-76.9941&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903584,-77.044789&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903819,-77.0284&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.901539,-77.046564&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.902204,-77.04337&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.803124,-77.040363&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.804718,-77.043363&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.810743,-77.044664&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.805317,-77.049883&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.902,-77.03353&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.805648,-77.05293&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.811456,-77.050276&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.814577,-77.052808&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.805767,-77.06072&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.9066,-77.05152&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.895914,-77.026064&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90088,-77.048911&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.883669,-77.113905&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884961,-77.08777&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889365,-77.077294&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.888251,-77.049426&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894722,-77.045128&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90093,-77.018677&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.943837,-77.077078&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.954812,-77.082426&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.905707,-77.003041&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.863833,-77.080319&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.857803,-77.086733&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.846222,-77.069275&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.918809,-77.041571&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.956595,-77.019815&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.949662,-77.027333&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.942016,-77.032652&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.956432,-77.032947&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.847977,-77.075104&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.8444,-77.085931&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.860789,-77.09586&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.84736,-77.095431&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.84232,-77.089555&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.848454,-77.084918&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.987,-77.029417&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90849,-77.063586&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.854691,-77.100555&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.955016,-77.069956&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.908142,-77.038359&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.92333,-77.0352&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889988,-76.995193&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.912659,-77.017669&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.889908,-76.983326&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897274,-76.994749&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.927095,-76.978924&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897195,-76.983575&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928644,-76.990955&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.843222,-76.999388&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.863897,-76.990037&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903732,-76.987211&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.916787,-77.028139&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.852248,-77.105022&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.834108,-77.087323&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.880705,-77.08596&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.867262,-77.072315&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.887378,-77.001955&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894851,-77.02324&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.917622,-77.01597&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.918155,-77.004746&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.892275,-77.013917&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.886978,-77.013769&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.887312,-77.025762&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.927497,-76.997194&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.862478,-77.086599&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.856319,-77.11153&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.871822,-77.107906&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.985404,-77.023082&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.981103,-77.097426&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.983838,-77.09221&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.096312,-77.192672&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.093783,-77.202501&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.988562,-77.096539&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.084125,-77.151291&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.98954,-77.098029&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.094772,-77.145213&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.98128,-77.011336&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.983627,-77.006311&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.99521,-77.02918&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.983525,-77.095367&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.961763,-77.085998&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.921074,-77.031887&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.87501,-77.0024&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.920387,-77.025672&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.876737,-76.994468&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.120045,-77.156985&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.099376,-77.188014&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.082779,-77.148827&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.123513,-77.15741&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.990249,-77.02935&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.107709,-77.152072&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.982456,-77.091991&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.000578,-77.00149&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.095661,-77.159048&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.989724,-77.023854&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.975,-77.01121&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.977933,-77.006472&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.992375,-77.100104&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.990639,-77.100239&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.977093,-77.094589&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.102099,-77.200322&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.094103,-77.132954&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.076331,-77.141378&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.102212,-77.177091&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.992679,-77.029457&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.999388,-77.031555&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.997033,-77.025608&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.114688,-77.171487&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.110314,-77.182669&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.923583,-77.050046&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90706,-77.015231&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898536,-76.931862&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.908473,-76.933099&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.878433,-77.03023&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.873755,-77.089233&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.86646,-77.04826&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.999634,-77.109647&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.96115,-77.088659&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.899703,-77.008911&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.837666,-77.09482&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.873219,-77.082104&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.869418,-77.095596&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.941154,-77.062036&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.103091,-77.196442&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.097636,-77.196636&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.997445,-77.023894&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.085394,-77.145803&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.900358,-77.012108&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.952369,-77.002721&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.975219,-77.016855&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.920682,-76.995876&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.866471,-77.076131&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.839912,-77.087083&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.119765,-77.166093&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.964992,-77.103381&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.084379,-77.146866&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.984691,-77.094537&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.88992,-77.071301&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903295,-77.065884&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.804378,-77.060866&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894941,-77.09169&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.869442,-77.104503&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898404,-77.024281&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897612,-77.080851&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.901755,-77.051084&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.801111,-77.068952&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.82175,-77.047494&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.802677,-77.063562&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.820064,-77.057619&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.82595,-77.058541&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.820932,-77.053096&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.833077,-77.059821&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.890612,-77.084801&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.986743,-77.000035&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.864702,-77.048672&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.96497,-77.075946&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90366,-77.034846&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89841,-77.039624&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.997653,-77.034499&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.912648,-77.041834&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.859254,-77.063275&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.908008,-76.996985&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896456,-77.104562&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898984,-77.078317&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.895377,-77.09713&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898412,-77.043182&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.876528,-77.12712&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.862467,-77.068242&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.913761,-77.027025&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.86612,-77.08787&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.89967,-77.003666&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90864,-77.02277&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.961339,-77.027855&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894474,-76.974828&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.896134,-76.9929&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.915604,-76.983683&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.946182,-77.08059&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.994113,-77.076986&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=39.105295,-77.194774&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.898301,-77.118009&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.812718,-77.044097&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.80704,-77.059817&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.813485,-77.049468&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.999378,-77.097882&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884829,-77.127671&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.958267,-77.084636&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928893,-77.03625&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.912652,-77.036278&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.999679,-77.051168&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.979875,-77.093522&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903658,-77.031737&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.908643,-77.012365&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.831516,-77.008133&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.893511,-77.041544&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.909394,-77.048728&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.995681,-77.038721&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.799267,-77.0447&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.925284,-77.032375&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.822738,-77.049265&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90843,-77.02714&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884377,-77.025791&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.947156,-77.065115&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.894972,-77.003135&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.828437,-77.086031&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884859,-77.155988&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.880992,-77.135271&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.812711,-77.061715&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.820058,-77.062821&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.890493,-77.017253&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.890544,-77.049379&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.888097,-77.038325&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.798133,-77.0487&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.843422,-77.064016&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.903598,-77.01397&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.841291,-77.063093&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.797557,-77.053766&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.818748,-77.047783&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.829545,-77.047844&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928552,-77.032224&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.90268,-77.035737&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.949813,-77.080217&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.91263,-76.971923&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.897407,-76.925907&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.870695,-76.982359&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.929261,-77.240654&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.92403,-77.235955&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.923116,-77.232108&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.924437,-77.217664&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.931911,-77.219261&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.928919,-77.225394&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.932636,-77.231825&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.962524,-77.361902&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.962095,-77.358815&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.960574,-77.356324&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.955079,-77.351649&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.957037,-77.359718&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.884916,-77.005965&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.948363,-77.338119&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.959633,-77.358741&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.892441,-77.048947&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.891805,-76.913563&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.960084,-77.353414&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.955314,-77.368416&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?latlng=38.923083,-77.227417&sensor=false
# Correct row 177
ad<-as.data.frame(ad)
levels(ad[,1])[levels(ad[,1])=="Fowler Hall, 800 Florida Ave NE, Washington, DC 20002, USA"] <- "Fowler Hall 800 Florida Ave NE, Washington, DC 20002, USA"
# Transform into dataframe and separate columns
ad1<-data.frame(do.call(rbind, str_split(ad[,1], ','))) # Separate columns in variables
ad2<-data.frame(do.call(rbind, str_split(ad1[,3], ' '))) # Separate columns in variables
## Warning in (function (..., deparse.level = 1) : number of columns of result
## is not a multiple of vector length (arg 127)
ad3 <- cbind(ad1[1],ad1[2],ad2[2], ad2[3], ad1[4]) # Join the variables
colnames(ad3) <- c("Address", "City", "State", "Zip", "Country") # Change variables names
gps2 <- cbind(gps,ad3) # Combine to gps the address, city, state and zip
colnames(gps2)[1] <- c("Start.station") # Change variable name to match master dataset
master_df <- merge(bike_weather,gps2,by="Start.station", all.x=TRUE)
After the merge, we noticed there were a few observations that we needed to spot check due to missing data. We inserted accurate information for the city variable.
master_df$City[master_df$Start.station=="Utah St & 11th St N "]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Utah St & 11th
## St N ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Veterans Pl & Pershing Dr "]<-"Silver Spring"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Veterans Pl &
## Pershing Dr ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Washington Blvd & Walter Reed Dr "]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Washington
## Blvd & Walter Reed Dr ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="8th & F St NW"]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "8th & F St
## NW", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Lee Hwy & N Nelson St"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Lee Hwy & N
## Nelson St", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="11th & K St NW"]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "11th & K St
## NW", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="20th & Bell St"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "20th & Bell
## St", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="23rd & E St NW "]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "23rd & E St NW
## ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="34th & Water St NW"]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "34th & Water
## St NW", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="34th St & Minnesota Ave SE"]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "34th St &
## Minnesota Ave SE", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Anacostia Ave & Benning Rd NE / River Terrace "]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Anacostia Ave
## & Benning Rd NE / River Terrace ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Court House Metro / 15th & N Uhle St "]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Court House
## Metro / 15th & N Uhle St ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Fenton St & Ellsworth Dr "]<-"Silver Spring"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Fenton St &
## Ellsworth Dr ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="King St Metro"]<-"Alexandria"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "King St
## Metro", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Lincoln Park / 13th & East Capitol St NE "]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Lincoln Park /
## 13th & East Capitol St NE ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Montgomery Ave & Waverly St "]<-"Bethesda"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Montgomery Ave
## & Waverly St ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="N Adams St & Lee Hwy"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "N Adams St &
## Lee Hwy", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="N Quincy St & Wilson Blvd"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "N Quincy St &
## Wilson Blvd", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="S Abingdon St & 36th St S"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "S Abingdon St
## & 36th St S", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="N Nelson St & Lee Hwy"]<-"Arlington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "N Nelson St &
## Lee Hwy", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Fenton St & New York Ave "]<-"Silver Spring"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Fenton St &
## New York Ave ", : invalid factor level, NA generated
master_df$City[master_df$Start.station=="Alta Tech Office"]<-"Washington"
## Warning in `[<-.factor`(`*tmp*`, master_df$Start.station == "Alta Tech
## Office", : invalid factor level, NA generated
\section{Data Cleaning and Feature Engineering}
With all of the individual data sets combined into a singular master dataframe, we moved onto the data cleaning stage.
The primary purpose of the data cleaning stage was to make variable data type transformations and address any missing values so that the data would be easier to work with in the exploratory data analysis (EDA) and modeling stages.
The primary purpose of the feature engineering stage was to create new variables based on combinations or derivatives of existing variables. Our inutition was that these variables would make a large impact in the prediction models.
The following sequence of commentary and code describes the data cleaning and feature engineering that was conducted.
Changed format of date variables using lubridate package.
master_df$date<- ymd(master_df$date)
master_df$Start.date<- ymd_hms(master_df$Start.date)
master_df$End.date<- ymd_hms(master_df$End.date)
head(master_df$date)
## [1] "2015-11-02" "2015-05-27" "2015-06-23" "2015-05-19" "2015-11-13"
## [6] "2015-08-29"
head(master_df$Start.date)
## [1] "2015-11-02 17:39:00 UTC" "2015-05-27 17:33:00 UTC"
## [3] "2015-06-23 18:09:00 UTC" "2015-05-19 19:00:00 UTC"
## [5] "2015-11-13 11:47:00 UTC" "2015-08-29 14:19:00 UTC"
head(master_df$End.date)
## [1] "2015-11-02 18:07:00 UTC" "2015-05-27 17:42:00 UTC"
## [3] "2015-06-23 18:15:00 UTC" "2015-05-19 19:02:00 UTC"
## [5] "2015-11-13 11:50:00 UTC" "2015-08-29 14:50:00 UTC"
Changed precipitation variable to type numeric.
master_df$PrecipitationIn<- as.numeric(as.character(master_df$PrecipitationIn))
## Warning: NAs introduced by coercion
Created a new hour variable.
master_df$hour<- hour(master_df$Start.date)
Added a binary weekday variable.
master_df$weekday<- weekdays(master_df$date)
Created a binary weekend variable.
master_df$weekend<- ifelse(master_df$weekday=="Saturday"|master_df$weekday=="Sunday",1,0)
Created a binary rush hour variable.
master_df$rushhour<- ifelse(master_df$hour<=9 & master_df$hour>=7 | master_df$hour<=19 & master_df$hour>=16,master_df$rushhour<-1,master_df$rushhour<-0)
Created a binary holiday variable.
master_df$holiday <- ifelse(master_df$date=='2015-01-01' | master_df$date=='2015-01-19' | master_df$date=='2015-02-16'|master_df$date=='2015-04-16'| master_df$date=='2015-04-17'| master_df$date=='2015-05-22'|master_df$date=='2015-05-25'| master_df$date=='2015-05-26'| master_df$date=='2015-07-02' | master_df$date=='2015-07-03' | master_df$date=='2015-07-06' | master_df$date=='2015-09-04' | master_df$date=='2015-09-07' | master_df$date=='2015-09-08' | master_df$date=='2015-10-12' | master_df$date=='2015-11-11' | master_df$date=='2015-11-26' | master_df$date=='2015-11-27' | master_df$date=='2015-12-24' | master_df$date=='2015-12-25' | master_df$date=='2015-12-31', master_df$holiday<-1,master_df$holiday<-0)
Combined the holiday and weekend variables for a new binary variable.
master_df$weekend_holiday<- ifelse(master_df$weekend==1 | master_df$holiday == 1,1,0)
Revalued subscription type, as certain datasets used “Member” and other datasets used “Registered” to signify a “subscriber”. Therefore, we combined these values.
master_df$Subscription.Type<-revalue(master_df$Subscription.Type,c('Member'='Registered'))
detach(package:plyr)
Created feels like temperature variable.
master_df$Mean.Humidity<-master_df$Mean.Humidity/100
master_df$feellike<-0.363445176+
0.98862246*(master_df$Mean.TemperatureF)+
4.777114035*(master_df$Mean.Humidity)+
-0.114037667*(master_df$Mean.TemperatureF*master_df$Mean.Humidity)+
-0.000850208*(master_df$Mean.TemperatureF^2)+
-0.020716198*(master_df$Mean.Humidity^2)+
0.000687678*((master_df$Mean.TemperatureF^2)*master_df$Mean.Humidity)+
0.000274954*(master_df$Mean.TemperatureF*(master_df$Mean.Humidity^2))+
0*((master_df$Mean.TemperatureF^2)*(master_df$Mean.Humidity^2))
Created season variable.
getSeason <- function(DATES) {
WS <- as.Date("2012-12-15", format="%Y-%m-%d") # Winter Solstice
SE <- as.Date("2012-03-15", format="%Y-%m-%d") # Spring Equinox
SS <- as.Date("2012-06-15", format="%Y-%m-%d") # Summer Solstice
FE <- as.Date("2012-09-15", format="%Y-%m-%d") # Fall Equinox
d <- as.Date(strftime(DATES, format="2012-%m-%d"))
ifelse (d>=WS|d<SE, "Winter", ifelse (d>=SE&d<SS,"Spring", ifelse(d>=SS&d<FE,"Summer","Fall")))
}
master_df$season<-getSeason(master_df$date)
master_df$season<-as.factor(master_df$season)
summary(master_df$season)
## Fall Spring Summer Winter
## 799058 957814 1048759 387277
Created a variable for adverse weather.
master_df$AdverseWeather <- ifelse(master_df$Events == "","False","True")
master_df$AdverseWeather <- as.factor(master_df$AdverseWeather)
summary(master_df$AdverseWeather)
## False True
## 1972462 1220446
Create a variable beautiful weather.
master_df$BeautifulWeather <- ifelse(master_df$Events == "" & master_df$Mean.TemperatureF >= 50 & master_df$Mean.TemperatureF <= 85,"True","False")
master_df$BeautifulWeather <- as.factor(master_df$BeautifulWeather)
summary(master_df$BeautifulWeather)
## False True
## 1578648 1614260
Remove the whitespace from city variable.
levels(master_df$City)
## [1] " Alexandria" " Arlington" " Bethesda" " Chevy Chase"
## [5] " Derwood" " McLean" " Potomac" " Reston"
## [9] " Rockville" " Silver Spring" " Takoma Park" " Tysons"
## [13] " Vienna" " Washington"
levels(master_df$City)[levels(master_df$City)==" Washington"]<-"Washington"
levels(master_df$City)[levels(master_df$City)==" Alexandria"]<-"Alexandria"
levels(master_df$City)[levels(master_df$City)==" Arlington"]<-"Arlington"
levels(master_df$City)[levels(master_df$City)==" Bethesda"]<-"Bethesda"
levels(master_df$City)[levels(master_df$City)==" Chevy Chase"]<-"Chevy Chase"
levels(master_df$City)[levels(master_df$City)==" Derwood"]<-"Derwood"
levels(master_df$City)[levels(master_df$City)==" McLean"]<-"McLean"
levels(master_df$City)[levels(master_df$City)==" Potomac"]<-"Potomac"
levels(master_df$City)[levels(master_df$City)==" Reston"]<-"Reston"
levels(master_df$City)[levels(master_df$City)==" Rockville"]<-"Rockville"
levels(master_df$City)[levels(master_df$City)==" Silver Spring"]<-"Silver Spring"
levels(master_df$City)[levels(master_df$City)==" Takoma Park"]<-"Takoma Park"
levels(master_df$City)[levels(master_df$City)==" Tysons"]<-"Tysons"
levels(master_df$City)[levels(master_df$City)==" Vienna"]<-"Vienna"
levels(master_df$City)
## [1] "Alexandria" "Arlington" "Bethesda" "Chevy Chase"
## [5] "Derwood" "McLean" "Potomac" "Reston"
## [9] "Rockville" "Silver Spring" "Takoma Park" "Tysons"
## [13] "Vienna" "Washington"
Created a month variable.
master_df$month<- month(master_df$date)
The precipitation variable was missing quite a few values. However, we knew that we could fill these missing values in with our best guess by using the event variable and the average precipitation by month.
If the weather event variable was blank, we knew it was a good weather day. On the contrary, if the weather event variable was not blank, we knew it was likely that rain was experienced. We therefore, filled in the missing value with the average precipitation for the respective month.
average_month_precipitation<-master_df%>%
select(month,PrecipitationIn)%>%
group_by(month)%>%
summarise(mean(PrecipitationIn,na.rm=TRUE))
colnames(average_month_precipitation)[2]<-"avg_precipitation"
master_df <- merge(master_df,average_month_precipitation,by="month")
master_df$new_precipitation<-
ifelse(is.na(master_df$PrecipitationIn) & !master_df$Events=="",
master_df$new_precipitation<- master_df$avg_precipitation,
ifelse(is.na(master_df$PrecipitationIn) & master_df$Events=="",master_df$new_precipitation<-0.000000,
master_df$new_precipitation<- master_df$PrecipitationIn))
#Drop variable used for calculation
master_df$avg_precipitation<-NULL
Factorized weekday, weekend, holiday, rushhour, cloud cover, hour, and zip variables.
master_df$weekday<- as.factor(master_df$weekday)
master_df$weekday<- factor(master_df$weekday, levels = c("Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"))
master_df$weekend<- as.factor(master_df$weekend)
master_df$holiday<- as.factor(master_df$holiday)
master_df$rushhour<- as.factor(master_df$rushhour)
master_df$weekend_holiday<- as.factor(master_df$weekend_holiday)
master_df$CloudCover<- as.factor(master_df$CloudCover)
master_df$hour<- as.factor(master_df$hour)
master_df$Zip<- as.factor(master_df$Zip)
The final list of variables and the dimensions of the dataframe.
dim(master_df)
## [1] 3192908 49
names(master_df)
## [1] "month" "Start.station"
## [3] "date" "Total.duration..ms."
## [5] "Start.date" "End.date"
## [7] "End.station" "Bike.number"
## [9] "Subscription.Type" "Max.TemperatureF"
## [11] "Mean.TemperatureF" "Min.TemperatureF"
## [13] "Max.Dew.PointF" "MeanDew.PointF"
## [15] "Min.DewpointF" "Max.Humidity"
## [17] "Mean.Humidity" "Min.Humidity"
## [19] "Max.Sea.Level.PressureIn" "Mean.Sea.Level.PressureIn"
## [21] "Min.Sea.Level.PressureIn" "Max.VisibilityMiles"
## [23] "Mean.VisibilityMiles" "Min.VisibilityMiles"
## [25] "Max.Wind.SpeedMPH" "Mean.Wind.SpeedMPH"
## [27] "Max.Gust.SpeedMPH" "PrecipitationIn"
## [29] "CloudCover" "Events"
## [31] "WindDirDegrees" "LATITUDE"
## [33] "LONGITUDE" "Address"
## [35] "City" "State"
## [37] "Zip" "Country"
## [39] "hour" "weekday"
## [41] "weekend" "rushhour"
## [43] "holiday" "weekend_holiday"
## [45] "feellike" "season"
## [47] "AdverseWeather" "BeautifulWeather"
## [49] "new_precipitation"
Saved a copy of the master dataset.
write.csv(master_df,"master_df.csv")
\section{2.3 Exploratory Data Analysis}
The data exploration stage focused on visualizing the relationships between variables and exploring patterns within the dataset.
The following sequence of commentary and code showcases the EDA that was conducted.
# Fine tune master_df before creating EDAs
master_df %>%
mutate(
date = ymd(date),
weekday = factor(weekday,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")), season = factor(season, levels = c("Spring", "Summer", "Fall", "Winter")),
hour = factor(hour, levels = 0:23),
duration.min = round(Total.duration..ms. / 60000, digits = 2)
) %>%
dplyr::select(Start.station, date, duration.min, End.station, Subscription.Type, CloudCover, Events, LATITUDE, LONGITUDE, Address, City, Zip, hour, weekday, weekend, rushhour, holiday, season, AdverseWeather, BeautifulWeather, weekend_holiday) -> map_df
The first thing we’ll do is to run distribution analysis on the main continuous variables in the dataset: total.rides and avg.duration. We will use levels of five categorical variables, i.e. Subscription.Type, weekend_holiday, rushhour, season, and AdverseWeather, as group coloring to generate high level between-group distribution comparison.
map_df %>%
group_by(date, hour, Subscription.Type) %>%
summarise(
total.rides = n(),
avg.duration = mean(duration.min),
weekend_holiday = first(weekend_holiday),
weekday = first(weekday),
rushhour = first(rushhour),
season = first(season),
AdverseWeather = first(AdverseWeather)
) -> day_hour_rides
# Create distribution histograms
g1 <- ggplot(data=day_hour_rides)
g1 + geom_histogram(mapping = aes(total.rides, fill = Subscription.Type), binwidth = 0.5) -> g2
g1 + geom_histogram(mapping = aes(log(day_hour_rides$avg.duration), fill = Subscription.Type), bins = 100) -> g3
g1 + geom_histogram(mapping = aes(total.rides, fill = weekend_holiday), binwidth = 0.5) -> g4
g1 + geom_histogram(mapping = aes(log(day_hour_rides$avg.duration), fill = weekend_holiday), bins = 100) -> g5
g1 + geom_histogram(mapping = aes(total.rides, fill = as.factor(rushhour)), binwidth = 0.5) -> g6
g1 + geom_histogram(mapping = aes(log(day_hour_rides$avg.duration), fill = as.factor(rushhour)), bins = 100) -> g7
g1 + geom_histogram(mapping = aes(total.rides, fill = as.factor(season)), binwidth = 0.5) -> g8
g1 + geom_histogram(mapping = aes(log(day_hour_rides$avg.duration), fill = as.factor(season)), bins = 100) -> g9
g1 + geom_histogram(mapping = aes(total.rides, fill = as.factor(AdverseWeather)), binwidth = 0.5) -> g10
g1 + geom_histogram(mapping = aes(log(day_hour_rides$avg.duration),
fill = as.factor(AdverseWeather)), bins = 100) -> g11
# Display the histograms
plot_grid(g2, g3, nrow = 2, rel_widths = c(1/2, 1/2))
plot_grid(g4, g5, nrow = 2, rel_widths = c(1/2, 1/2))
plot_grid(g6, g7, nrow = 2, rel_widths = c(1/2, 1/2))
plot_grid(g8, g9, nrow = 2, rel_widths = c(1/2, 1/2))
plot_grid(g10, g11, nrow = 2, rel_widths = c(1/2, 1/2))
Our first impression is that the distribution of total.rides is skewing right, while the distribution of avg.duration has two modes.
More specifically, the avg.duration distribution by Subscription.Type graph indicates that registered bikers are contributing to the lower duration mode while the casual bikers are contrbution to the higher mode. Casual bikers have much less total.rides than the registered bikers. In the distribution by rushhour graph, commuting hour rides are dominating hours that have higher count of total.rides. Rushhour rides are also contributing more to the lower avg.duration mode. Another interesting finding from the distribution by season graph is that winter has much more short-duration rides than other seasons, while spring and summer have more long-duration rides among casual riders.
The above analysis indicates that time-related factors are having a strong impact on the dependent variables. In our next step, we will create heatmaps for hour of the day / day of the week to futher explore the patterns.
# Create a subset just for the time heatmap
day_hour_rides %>%
ungroup() %>%
select(hour, weekday, total.rides, avg.duration) %>%
mutate(total_duration = total.rides * avg.duration,
hour = factor(hour, levels = (0:23))) %>%
group_by(hour, weekday) %>%
summarise(count.rides = sum(total.rides), total.duration = sum(total_duration)) -> df.1
# Create time based heatmaps
g10 <- ggplot(data=df.1, aes(x=hour, y=weekday, fill=count.rides)) +
geom_tile(color="white", size=0.1)+ coord_equal() +
labs(x=NULL, y=NULL, title="Count of Rides Per Weekday & Hour of Day") +
theme_tufte(base_family="Calibri") + theme(plot.title=element_text(hjust=0.5, size = 10)) +
theme(axis.ticks=element_blank()) + theme(axis.text=element_text(size=7)) + theme(legend.position="none") +
scale_fill_gradient(low = "white", high = "steelblue")
g11 <- ggplot(data=df.1, aes(x=hour, y=weekday, fill=total.duration)) +
geom_tile(color="white", size=0.1)+ coord_equal() +
labs(x=NULL, y=NULL, title="Total Duration Per Weekday & Hour of Day") +
theme_tufte(base_family="Calibri") + theme(plot.title=element_text(hjust=0.5, size = 10)) + theme(legend.position="none") +
theme(axis.ticks=element_blank()) + theme(axis.text=element_text(size=7)) +
scale_fill_gradient(low = "white", high = "firebrick")
g12 <- ggplot(data=df.1, aes(x=hour, y=weekday, fill=total.duration/count.rides)) +
geom_tile(color="white", size=0.1)+ coord_equal() +
labs(x=NULL, y=NULL, title="Average Duration Per Weekday & Hour of Day") +
theme_tufte(base_family="Calibri") + theme(plot.title=element_text(hjust=0.5, size = 10)) + theme(legend.position="none") +
theme(axis.ticks=element_blank()) + theme(axis.text=element_text(size=7)) +
scale_fill_gradient(low = "white", high = "springgreen3")
plot_grid(g10, g12, nrow = 2, rel_heights = c(1/2, 1/2))
Here we find some interesting patterns from the hour-weekday heatmap. It seems that more rides have taken place during rush hours on work days, while total.rides distributes evenly in day time on weekend. The avg.duration of the rides appears to be longer during day time over the weekend.
After we have a general understanding of the data, we move on to explore the geospatial distribution of total.rides across the DC metro area. First let us plot the bike stations.
# Create station list with coordinates, total count of rides, and total duration of rides
map.stations <- map_df %>%
group_by(Start.station) %>%
summarise(total.rides = n(),
avg.duration = mean(duration.min),
subscriber.percentage = mean(Subscription.Type == "Registered"),
lat = first(LATITUDE),
lon = first(LONGITUDE)
)
head(map.stations)
## # A tibble: 6 × 6
## Start.station total.rides avg.duration
## <fctr> <int> <dbl>
## 1 10th & E St NW 13611 25.61816
## 2 10th & Florida Ave NW 8316 12.16949
## 3 10th & Monroe St NE 3916 15.95705
## 4 10th & U St NW 13463 12.52403
## 5 10th St & Constitution Ave NW 19128 28.62108
## 6 11th & F St NW 13898 20.94567
## # ... with 3 more variables: subscriber.percentage <dbl>, lat <dbl>,
## # lon <dbl>
# Plotly not working, skip
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showland = TRUE,
landcolor = toRGB("gray85"),
subunitwidth = 1,
countrywidth = 1,
subunitcolor = toRGB("white"),
countrycolor = toRGB("white")
)
p <- plot_geo(map.stations, locationmode = 'city', sizes = c(1, 250)) %>%
add_markers(
x = ~lon, y = ~lat, size = ~total.rides, color = ~avg.duration, hoverinfo = "text",
text = ~paste(map.stations$Start.station, "<br />",
"Total Rides: ", map.stations$total.rides, "<br />",
"Average Duration: ", map.stations$avg.duration, " mins",
"Percentage of Subscribers: ", map.stations$subscriber.percentage)
) %>%
layout(title = '2015 Capital Bike Share Stations', geo = g)
Below we can see the locations of all the bike share stations across the DMV area, with the circle size representing total.rides and color representing avg.rides. It appears that bike stations are spreading out well in the DMV area, with stations located in DMV ourskirts such as Alexandria, VA, Bethesda, MD, and Silver Spring, MD.
# download basic map layers for plotting
base.map <- qmap("Wasington DC", zoom = 12, source= "google", maptype="roadmap", color = "bw", crop=FALSE, legend='topleft')
base.map.1 <- qmap("Wasington DC", zoom = 13, source= "google", maptype="roadmap", color = "bw", crop=FALSE, legend='topleft')
base.map.2 <- qmap("Wasington DC", zoom = 14, source= "google", maptype="roadmap", color = "bw", crop=FALSE, legend='topleft')
base.map + geom_point(aes(x = lon, y = lat, size=total.rides, color=avg.duration), data = map.stations,
alpha = .5)+ scale_size(range = c(1, 5)) + scale_colour_gradient(low = "steelblue", high = "springgreen")
base.map.1 + geom_point(aes(x = lon, y = lat, size=total.rides, color=avg.duration), data = map.stations,
alpha = .5) + scale_size(range = c(1, 5)) + scale_colour_gradient(low = "steelblue", high = "springgreen")
base.map.2 + geom_point(aes(x = lon, y = lat, size=total.rides, color=avg.duration), data = map.stations,
alpha = .5) + scale_size(range = c(1, 10)) + scale_colour_gradient(low = "steelblue", high = "springgreen")
But how does the actual count of total.rides distribute across the area? Will it go in line with the bike station locations? We then move on to create a heatmap based on the density of total.rides on the map. The graph below indicates that total.rides are way more condensed than the distribution of the bike stations, with the most rides happening in the DC heart area, such as Dupont Circle, Logan Circle, National Mall, Metro Center, Gallery Place, World Bank, and Lincoln Memorial.
# Create a ride data set with location and ride, will also keep sliceability with other factors
# Adjust factor level names for better display in faceted visuals
map_df %>%
mutate(lon = LONGITUDE, lat = LATITUDE) %>%
select(Subscription.Type, Events, lat, lon, hour,
weekday, weekend, rushhour, holiday, season, AdverseWeather, BeautifulWeather, weekend_holiday) %>%
mutate(
hour = as.numeric(hour),
AdverseWeather = as.factor(if_else(AdverseWeather=="True", "Adverse: Yes", "Adverse: No")),
BeautifulWeather = as.factor(if_else(BeautifulWeather == "True", "Beautiful: Yes", "Beautiful: No")),
holiday = as.factor(if_else(holiday == "1", "Holiday: Yes", "Holiday: No")),
weekend = as.factor(if_else(weekend == "1", "Weekend: Yes", "Weekend: No")),
rushhour = as.factor(if_else(rushhour == "1", "Rush Hour: Yes", "Rush Hour: No")),
weekend_holiday = as.factor(if_else(weekend_holiday == "1", "Leisure Day: Yes", "Leisure Day: No")),
time_of_day = factor(if_else(hour>4 & hour < 13, "Morning",
if_else(hour>12 & hour < 19, "Afternoon",
if_else(hour >16 & hour <= 23, "Night", "Late Night"))),
levels = c("Morning", "Afternoon", "Night", "Late Night")),
hour = factor(hour, levels = 0:23)) -> ride_df
# Create ride density maps
base.map + geom_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat), size = 0.4) + stat_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), size = 1,
bins = 5, geom = "polygon", contour = TRUE) + scale_fill_gradient(low = "springgreen", high = "red") +
scale_alpha(range = c(0, 0.3), guide = FALSE)
base.map.1 + geom_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat), size = 0.4) + stat_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), size = 2,
bins = 8, geom = "polygon", contour = TRUE) + scale_fill_gradient(low = "springgreen", high = "red") +
scale_alpha(range = c(0, 0.3), guide = FALSE)
base.map.2 + geom_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat), size = 0.5) + stat_density2d(data = ride_df[sample(1:nrow(ride_df), 10000),],
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), size = 3,
bins = 15, geom = "polygon", contour = TRUE) + scale_fill_gradient(low = "springgreen", high = "red") +
scale_alpha(range = c(0, 0.3), guide = FALSE)
Since we now have a general idea of where the most rides are happening in DC, our next step is to slice the ridership data with factors we generated from time and weather and compare the patterns. We wanted to see if the popularity of the stations changed under different time and weather conditions.
# Create a subsliced ridership set of 15000 observations
ride_df.sample <- ride_df[sample(1:nrow(ride_df), 15000),]
# Ride frequency heatmap by seasons
dc.map.3 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~season, nrow = 1) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution by Seasons") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
The first graph shows the distribution of rides in each season of the year of 2015. In Spring and Summer, both Lincoln Memorial and National Mall enjoy more rides from other time of the year. During winter, however, it seems that more people are taking bike rides around Logan Circle, Foggy Bottom, and Metro Center, i.e. the inner center of the District.
# Ride frequency heatmap by time of day
dc.map.3 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~time_of_day, nrow = 1) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution by Time of Day") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
Another similar comparison based on time of the day shows that people are taking more rides in central to northeastern DC in the morning and more in central to southwestern DC in the afternoon. Bikers start their rides mostly around DuPont circle, Logan Circle, Metro Center, and Gallery Place at night. Few people will start their rides in late night, of course; but we are seeing relatively more rides in the central to northwestern DC area. It seems that people’s daily routine is contributing to this pattern, considering that these areas correspond to the residence area, working area, and entertaining/event area in DC.
# Ride frequency heatmap by rush hour
dc.map.3 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~rushhour) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution - Rush Hour?") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
Since time is creating interesting impact on total.rides and bikes can be a useful tool for commuting, we want to check out specifically the allocation of rides for rush hours againt other time of the day. In the above graph, we notice that more people are taking bike rides near Metro Center, Gallery Place, and Capital Hill during rush hours, while more people are taking rides near Lincoln Memorial and National Mall during non-rush hours. This information is interesting, since Metro center, Gallery place, and Capital Hill are places where many people go to work, while (apparently) Lincoln Memorial and National Mall are popular tourist sites.
# Ride frequency heatmap by weekend/holiday
dc.map.3 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~weekend_holiday + BeautifulWeather, nrow = 1) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution - Leisure Days X Good Weather") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
Since Lincoln Memorial and National Mall are enjoying much love in non-rush hours, we are interested to check out if leisure time will have a different pattern for total.rides distribution. Comparing the left two graphs in the above chart, it is apparent that the distribution of ridership is sparse for leisure days in good weather: riders are of course starting their rides from many different stations across the District. Interestingly, the second left graph shows that bikers mostly still ride in the central DC during working days despite the good weather. Commuting really seems to be a major function of the shared bikes!
Since commuting seems to be a really big factor for the distribution of rides, we are insterested to dig a bit deeper into the type of subscription for each ride. Since bike share subscribers are more likely to use bikes for commute, will we see a clear difference between casual and registered bikers?
# Ride frequency heatmap by Subscription Type
dc.map.2 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~Subscription.Type + rushhour, nrow = 1) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution by Subscription Type & Rush Hour") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
The above graph shows that casual bikers are (apparently) taking more rides around the tourist attraction sites in DC, no matter if it’s in rush hour or not. For the subscribers, however, the distribution of rides are surprisingly even no matter it’s rush hour or not. If we really consider the nature of commuting, this actually makes sense: for people that ride bikes based on their daily commuting needs, they will need to use bikes to get to work or go home. The green area in the right two graphs actually shows the routine start stations for the registered users!
# Ride frequency heatmap by adverse weather
dc.map.2 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~AdverseWeather) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution - Adverse Weather?") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
A quick comparison of adverse weather against non-adverse weather shows not much difference for the ridership. This might be due to the nature of our integrated weather data: the weather information is the mean values for a whole day, thus making it hard for the slicers to differentiate ridership distribution on a lower grain level.
# Ride frequency heatmap by rush hour and adverse weather
dc.map.3 + stat_density2d(aes(x=lon, y=lat, fill=..level.., alpha=..level..),
bins=7, geom="polygon", data=ride_df.sample) +
scale_fill_gradient(low="springgreen", high="tomato") + scale_alpha(range = c(0.1, 0.6), guide = FALSE) +
facet_wrap(~AdverseWeather + rushhour, nrow = 1) +
guides(fill=guide_legend(title="ride\nfrequency")) +
ggtitle("Ride Distribution - Bad Weather X Rush Hour") +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.text = element_blank(),
plot.title = element_text(color="black", size=16, hjust=0))
Again, in the graph shown above here, we observe a bigger differece from Rush Hour than the weather. This seems to be related to the same challenge we are having from the weather variables.
\section{3. Modeling}In total, three different models were tested: Multiple Linear Regression, Regression Tree and Random Forest. The prediction performance of the models was assessed based on the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
Before creating the models, the final dataset was grouped by day and by hour since this was the level of ridership that we wanted to predicted. We also dropped unnecessary variables that would not be used for modeling. Lastly, we split the final data set into test and train dataframes based on a 70/30 random sampling split.
model_df <- master_df %>%
group_by(date,hour,Subscription.Type,Mean.TemperatureF,MeanDew.PointF,Mean.Humidity,Mean.Sea.Level.PressureIn,Mean.VisibilityMiles,Mean.Wind.SpeedMPH,new_precipitation,CloudCover,Events,City,weekday,weekend,rushhour,weekend_holiday,feellike,season,AdverseWeather,BeautifulWeather) %>%
summarise(total_rides = length(date))
dim(model_df)
## [1] 67966 22
names(model_df)
## [1] "date" "hour"
## [3] "Subscription.Type" "Mean.TemperatureF"
## [5] "MeanDew.PointF" "Mean.Humidity"
## [7] "Mean.Sea.Level.PressureIn" "Mean.VisibilityMiles"
## [9] "Mean.Wind.SpeedMPH" "new_precipitation"
## [11] "CloudCover" "Events"
## [13] "City" "weekday"
## [15] "weekend" "rushhour"
## [17] "weekend_holiday" "feellike"
## [19] "season" "AdverseWeather"
## [21] "BeautifulWeather" "total_rides"
smp_size <- floor(0.7 * nrow(model_df))
set.seed(700)
train_ind <- sample(seq_len(nrow(model_df)), size = smp_size)
linearreg_train <- model_df[train_ind, ]
linearreg_test <- model_df[-train_ind, ]
\section{Model 1 - Multiple Linear Regression Model}
A collinearity test was conducted for the numeric variables.
num_vars <- c("Mean.TemperatureF","MeanDew.PointF","Mean.Humidity","Mean.Sea.Level.PressureIn","Mean.VisibilityMiles","Mean.Wind.SpeedMPH", "new_precipitation")
collinear_test_df <- linearreg_train[num_vars]
plot(collinear_test_df)
qplot(x=Var1, y=Var2, data = melt(cor(collinear_test_df)), fill=value, geom = "tile") +
labs(xlab = "Var1", ylab = "Var2") +
ggtitle("Correlation Coefficient Matrix")
Mean.TemperatureF and MeanDew.PointF showed a high correlation, so MeanDew.PointF was dropped. Date was also dropped, as including the variable would have lead to overfitting and also created a factor with far too many levels to be included in the model.
With the list of variables finalized, three modeling selection techniques were tested: Adjusted R Squared, AIC, and BIC. These techniques use different methods for penalizing the inclusion of each additional variable within the model, so we were interested to understand the impact this would have on each models prediction.
linearreg_train$MeanDew.PointF <- NULL
linearreg_train$date <- NULL
names(linearreg_train)
## [1] "hour" "Subscription.Type"
## [3] "Mean.TemperatureF" "Mean.Humidity"
## [5] "Mean.Sea.Level.PressureIn" "Mean.VisibilityMiles"
## [7] "Mean.Wind.SpeedMPH" "new_precipitation"
## [9] "CloudCover" "Events"
## [11] "City" "weekday"
## [13] "weekend" "rushhour"
## [15] "weekend_holiday" "feellike"
## [17] "season" "AdverseWeather"
## [19] "BeautifulWeather" "total_rides"
m_full_linear <- lm(total_rides ~ ., data = na.omit(linearreg_train))
summary(m_full_linear)
##
## Call:
## lm(formula = total_rides ~ ., data = na.omit(linearreg_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.91 -48.42 -14.57 30.25 914.53
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -420.2290 97.9117 -4.292 1.78e-05 ***
## hour1 -28.1164 4.6378 -6.062 1.35e-09 ***
## hour2 -40.1467 4.9679 -8.081 6.59e-16 ***
## hour3 -71.0261 5.5359 -12.830 < 2e-16 ***
## hour4 -66.6629 5.3610 -12.435 < 2e-16 ***
## hour5 -4.7038 4.4733 -1.052 0.293025
## hour6 37.8042 4.0588 9.314 < 2e-16 ***
## hour7 88.0880 3.8490 22.886 < 2e-16 ***
## hour8 116.7852 3.7463 31.173 < 2e-16 ***
## hour9 87.1247 3.7573 23.188 < 2e-16 ***
## hour10 77.7497 3.7760 20.591 < 2e-16 ***
## hour11 87.3201 3.7674 23.178 < 2e-16 ***
## hour12 98.0514 3.7716 25.997 < 2e-16 ***
## hour13 97.4776 3.7658 25.885 < 2e-16 ***
## hour14 94.7998 3.7841 25.052 < 2e-16 ***
## hour15 100.3481 3.7663 26.643 < 2e-16 ***
## hour16 112.7898 3.7298 30.240 < 2e-16 ***
## hour17 138.8497 3.7008 37.519 < 2e-16 ***
## hour18 129.7897 3.7188 34.901 < 2e-16 ***
## hour19 103.7854 3.7881 27.398 < 2e-16 ***
## hour20 80.0564 3.8680 20.697 < 2e-16 ***
## hour21 61.3715 3.9424 15.567 < 2e-16 ***
## hour22 43.0206 4.0359 10.659 < 2e-16 ***
## hour23 22.7684 4.1510 5.485 4.16e-08 ***
## Subscription.TypeRegistered 79.5777 1.0847 73.364 < 2e-16 ***
## Mean.TemperatureF -21.6701 3.0046 -7.212 5.60e-13 ***
## Mean.Humidity -42.2389 7.5860 -5.568 2.59e-08 ***
## Mean.Sea.Level.PressureIn 3.4181 3.0947 1.105 0.269373
## Mean.VisibilityMiles 3.0808 0.6550 4.703 2.57e-06 ***
## Mean.Wind.SpeedMPH -0.4087 0.1781 -2.295 0.021727 *
## new_precipitation -10.9639 2.6156 -4.192 2.77e-05 ***
## CloudCover1 -5.0851 4.5846 -1.109 0.267364
## CloudCover2 -2.4634 4.5040 -0.547 0.584419
## CloudCover3 2.3624 4.2291 0.559 0.576427
## CloudCover4 0.5633 4.3212 0.130 0.896284
## CloudCover5 2.4764 4.2147 0.588 0.556821
## CloudCover6 3.6226 4.2587 0.851 0.394973
## CloudCover7 1.4302 4.2217 0.339 0.734786
## CloudCover8 -4.4144 4.4680 -0.988 0.323163
## EventsFog 8.4081 5.9452 1.414 0.157290
## EventsFog-Rain 8.5135 4.1056 2.074 0.038118 *
## EventsFog-Rain-Snow 9.0902 9.5589 0.951 0.341628
## EventsFog-Rain-Thunderstorm 33.6352 10.7748 3.122 0.001800 **
## EventsFog-Snow -1.2149 12.8963 -0.094 0.924944
## EventsRain 1.5175 2.3670 0.641 0.521466
## EventsRain-Hail-Thunderstorm 10.5665 7.9060 1.337 0.181391
## EventsRain-Snow -4.5116 4.3509 -1.037 0.299769
## EventsRain-Thunderstorm 4.9787 3.1364 1.587 0.112430
## EventsSnow 1.8704 3.9680 0.471 0.637368
## EventsThunderstorm -1.1768 6.5843 -0.179 0.858158
## CityArlington 32.9890 1.6346 20.182 < 2e-16 ***
## CityBethesda -10.0073 1.8886 -5.299 1.17e-07 ***
## CityChevy Chase -25.8668 3.1785 -8.138 4.14e-16 ***
## CityDerwood -52.9897 5.6771 -9.334 < 2e-16 ***
## CityRockville -24.6696 2.4590 -10.032 < 2e-16 ***
## CitySilver Spring -26.1787 2.3017 -11.374 < 2e-16 ***
## CityTakoma Park -21.1620 2.4000 -8.818 < 2e-16 ***
## CityWashington 206.5884 1.5924 129.736 < 2e-16 ***
## weekdayTuesday -1.0615 1.9311 -0.550 0.582537
## weekdayWednesday 2.4662 1.9250 1.281 0.200151
## weekdayThursday 0.9640 1.9180 0.503 0.615254
## weekdayFriday 4.5937 1.9357 2.373 0.017643 *
## weekdaySaturday 14.3572 2.7417 5.237 1.64e-07 ***
## weekdaySunday 9.6747 2.7510 3.517 0.000437 ***
## weekend1 NA NA NA NA
## rushhour1 NA NA NA NA
## weekend_holiday1 -8.3739 2.2553 -3.713 0.000205 ***
## feellike 25.9817 3.4620 7.505 6.27e-14 ***
## seasonSpring 3.3573 1.7179 1.954 0.050675 .
## seasonSummer 0.6555 2.1158 0.310 0.756699
## seasonWinter -15.4683 2.1380 -7.235 4.73e-13 ***
## AdverseWeatherTrue NA NA NA NA
## BeautifulWeatherTrue 6.6606 2.1143 3.150 0.001633 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.81 on 39920 degrees of freedom
## Multiple R-squared: 0.4771, Adjusted R-squared: 0.4762
## F-statistic: 527.9 on 69 and 39920 DF, p-value: < 2.2e-16
str(master_df)
## 'data.frame': 3192908 obs. of 49 variables:
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Start.station : Factor w/ 364 levels "10th & E St NW",..: 244 4 252 272 234 56 46 85 240 264 ...
## $ date : Date, format: "2015-01-15" "2015-01-10" ...
## $ Total.duration..ms. : num 651115 359735 508262 804681 3882809 ...
## $ Start.date : POSIXct, format: "2015-01-15 19:44:00" "2015-01-10 18:58:00" ...
## $ End.date : POSIXct, format: "2015-01-15 19:55:00" "2015-01-10 19:04:00" ...
## $ End.station : Factor w/ 364 levels "10th & E St NW",..: 348 150 299 91 234 36 48 240 26 342 ...
## $ Bike.number : Factor w/ 3582 levels "W00005","W00006",..: 2540 2931 2418 1252 2711 3169 1265 915 264 1650 ...
## $ Subscription.Type : Factor w/ 2 levels "Casual","Registered": 2 2 2 2 1 2 2 2 2 2 ...
## $ Max.TemperatureF : int 42 30 67 52 67 26 43 30 43 37 ...
## $ Mean.TemperatureF : int 37 25 55 44 55 19 33 23 37 34 ...
## $ Min.TemperatureF : int 32 19 42 36 42 12 23 15 31 30 ...
## $ Max.Dew.PointF : int 27 6 55 32 55 2 29 18 34 24 ...
## $ MeanDew.PointF : int 23 -1 45 29 45 -4 18 5 29 21 ...
## $ Min.DewpointF : int 19 -5 31 26 31 -8 3 -10 23 14 ...
## $ Max.Humidity : int 75 43 89 64 89 42 64 68 85 78 ...
## $ Mean.Humidity : num 0.6 0.32 0.68 0.53 0.68 0.36 0.47 0.49 0.67 0.65 ...
## $ Min.Humidity : int 45 21 46 41 46 29 29 30 49 51 ...
## $ Max.Sea.Level.PressureIn : num 30.2 30.6 30.1 29.9 30.1 ...
## $ Mean.Sea.Level.PressureIn: num 30.1 30.6 29.9 29.8 29.9 ...
## $ Min.Sea.Level.PressureIn : num 30 30.4 29.7 29.7 29.7 ...
## $ Max.VisibilityMiles : int 10 10 10 10 10 10 10 10 10 10 ...
## $ Mean.VisibilityMiles : int 10 10 8 10 8 10 9 10 7 9 ...
## $ Min.VisibilityMiles : int 9 10 2 10 2 10 2 10 2 4 ...
## $ Max.Wind.SpeedMPH : int 12 20 26 15 26 24 31 30 13 22 ...
## $ Mean.Wind.SpeedMPH : int 6 8 13 6 13 11 16 15 5 15 ...
## $ Max.Gust.SpeedMPH : int 23 25 39 22 39 31 41 40 16 32 ...
## $ PrecipitationIn : num 0 0 0.2 0 0.2 0 NA NA 0.65 0.01 ...
## $ CloudCover : Factor w/ 9 levels "0","1","2","3",..: 6 1 9 8 9 4 6 6 8 7 ...
## $ Events : Factor w/ 12 levels "","Fog","Fog-Rain",..: 1 1 7 1 7 1 11 11 9 11 ...
## $ WindDirDegrees : int 288 321 219 209 219 258 317 297 156 328 ...
## $ LATITUDE : num 38.9 38.9 39 38.9 38.9 ...
## $ LONGITUDE : num -77 -77 -77.1 -77 -77 ...
## $ Address : Factor w/ 424 levels "1-3 Atlantic St SW",..: 151 163 296 115 263 141 118 197 46 112 ...
## $ City : Factor w/ 14 levels "Alexandria","Arlington",..: 14 14 3 14 14 14 14 14 14 2 ...
## $ State : Factor w/ 3 levels "DC","MD","VA": 1 1 2 1 1 1 1 1 1 3 ...
## $ Zip : Factor w/ 60 levels "","20001","20002",..: 22 2 36 10 1 4 7 23 8 54 ...
## $ Country : Factor w/ 1 level " USA": 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 20 19 16 20 13 18 19 20 18 9 ...
## $ weekday : Factor w/ 7 levels "Monday","Tuesday",..: 4 6 7 7 7 4 5 3 5 2 ...
## $ weekend : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
## $ rushhour : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 2 2 2 ...
## $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekend_holiday : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
## $ feellike : num 36.7 25.3 52.6 42.8 52.6 ...
## $ season : Factor w/ 4 levels "Fall","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ AdverseWeather : Factor w/ 2 levels "False","True": 1 1 2 1 2 1 2 2 2 2 ...
## $ BeautifulWeather : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ new_precipitation : num 0 0 0.2 0 0.2 ...
anova(m_full_linear)
## Analysis of Variance Table
##
## Response: total_rides
## Df Sum Sq Mean Sq F value Pr(>F)
## hour 23 19225180 835877 87.3814 < 2.2e-16 ***
## Subscription.Type 1 9255166 9255166 967.5212 < 2.2e-16 ***
## Mean.TemperatureF 1 3282030 3282030 343.0985 < 2.2e-16 ***
## Mean.Humidity 1 585849 585849 61.2438 5.166e-15 ***
## Mean.Sea.Level.PressureIn 1 93292 93292 9.7526 0.0017920 **
## Mean.VisibilityMiles 1 185482 185482 19.3900 1.068e-05 ***
## Mean.Wind.SpeedMPH 1 37555 37555 3.9259 0.0475540 *
## new_precipitation 1 107181 107181 11.2045 0.0008167 ***
## CloudCover 8 112447 14056 1.4694 0.1625118
## Events 11 128281 11662 1.2191 0.2673989
## City 8 312264428 39033054 4080.4568 < 2.2e-16 ***
## weekday 6 251930 41988 4.3894 0.0001933 ***
## weekend_holiday 1 156659 156659 16.3769 5.201e-05 ***
## feellike 1 1897634 1897634 198.3758 < 2.2e-16 ***
## season 3 769646 256549 26.8192 < 2.2e-16 ***
## BeautifulWeather 1 94930 94930 9.9238 0.0016327 **
## Residuals 39920 381868889 9566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n = nrow(na.omit(linearreg_train))
stepAIC(na.omit(m_full_linear), k=log(n)) #BIC
## Start: AIC=367218.2
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + rushhour + weekend_holiday + feellike +
## season + AdverseWeather + BeautifulWeather
##
##
## Step: AIC=367218.2
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + rushhour + weekend_holiday + feellike +
## season + BeautifulWeather
##
##
## Step: AIC=367218.2
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + weekend_holiday + feellike + season +
## BeautifulWeather
##
##
## Step: AIC=367218.2
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend_holiday + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - Events 11 192469 382061358 367122
## - CloudCover 8 254980 382123869 367160
## - weekday 6 351598 382220487 367191
## - Mean.Sea.Level.PressureIn 1 11670 381880560 367209
## - Mean.Wind.SpeedMPH 1 50392 381919282 367213
## - BeautifulWeather 1 94930 381963819 367218
## <none> 381868889 367218
## - weekend_holiday 1 131879 382000768 367221
## - new_precipitation 1 168076 382036966 367225
## - Mean.VisibilityMiles 1 211617 382080507 367230
## - Mean.Humidity 1 296564 382165454 367239
## - Mean.TemperatureF 1 497591 382366480 367260
## - season 3 702779 382571668 367260
## - feellike 1 538786 382407675 367264
## - Subscription.Type 1 51485750 433354639 372266
## - hour 23 78275757 460144647 374431
## - City 8 314377810 696246699 391153
##
## Step: AIC=367121.8
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City +
## weekday + weekend_holiday + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - CloudCover 8 271415 382332773 367065
## - weekday 6 361679 382423037 367096
## - Mean.Sea.Level.PressureIn 1 9607 382070965 367112
## - Mean.Wind.SpeedMPH 1 69908 382131266 367119
## <none> 382061358 367122
## - weekend_holiday 1 119617 382180975 367124
## - new_precipitation 1 129748 382191106 367125
## - BeautifulWeather 1 141105 382202463 367126
## - Mean.VisibilityMiles 1 232143 382293501 367136
## - Mean.Humidity 1 299748 382361106 367143
## - season 3 749999 382811357 367168
## - Mean.TemperatureF 1 563989 382625347 367170
## - feellike 1 614706 382676064 367176
## - Subscription.Type 1 51452393 433513751 372164
## - hour 23 78304274 460365632 374334
## - City 8 314288739 696350097 391042
##
## Step: AIC=367065.5
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + City + weekday +
## weekend_holiday + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - weekday 6 333120 382665894 367037
## - Mean.Sea.Level.PressureIn 1 510 382333284 367055
## <none> 382332773 367065
## - weekend_holiday 1 102626 382435399 367066
## - Mean.Wind.SpeedMPH 1 129272 382462046 367068
## - BeautifulWeather 1 155190 382487963 367071
## - new_precipitation 1 175125 382507898 367073
## - Mean.VisibilityMiles 1 271414 382604187 367083
## - Mean.Humidity 1 494402 382827175 367107
## - Mean.TemperatureF 1 529365 382862138 367110
## - feellike 1 581574 382914348 367116
## - season 3 806703 383139476 367118
## - Subscription.Type 1 51384468 433717241 372098
## - hour 23 78276818 460609591 374270
## - City 8 314077462 696410235 390961
##
## Step: AIC=367036.7
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + City + weekend_holiday +
## feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - Mean.Sea.Level.PressureIn 1 286 382666180 367026
## - weekend_holiday 1 11393 382677287 367027
## <none> 382665894 367037
## - Mean.Wind.SpeedMPH 1 132531 382798424 367040
## - BeautifulWeather 1 144342 382810236 367041
## - new_precipitation 1 152719 382818613 367042
## - Mean.VisibilityMiles 1 250936 382916830 367052
## - Mean.TemperatureF 1 531313 383197207 367082
## - feellike 1 583548 383249442 367087
## - Mean.Humidity 1 602672 383268565 367089
## - season 3 858329 383524223 367095
## - Subscription.Type 1 51323322 433989216 372059
## - hour 23 78187421 460853315 374228
## - City 8 313954562 696620456 390909
##
## Step: AIC=367026.1
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + City + weekend_holiday + feellike + season +
## BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - weekend_holiday 1 11985 382678166 367017
## <none> 382666180 367026
## - Mean.Wind.SpeedMPH 1 145238 382811419 367031
## - BeautifulWeather 1 146404 382812585 367031
## - new_precipitation 1 153609 382819790 367032
## - Mean.VisibilityMiles 1 251761 382917941 367042
## - Mean.TemperatureF 1 534751 383200931 367071
## - feellike 1 587942 383254122 367077
## - Mean.Humidity 1 603853 383270034 367079
## - season 3 880246 383546426 367086
## - Subscription.Type 1 51326559 433992740 372049
## - hour 23 78188439 460854619 374217
## - City 8 313954275 696620456 390898
##
## Step: AIC=367016.8
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + City + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## <none> 382678166 367017
## - Mean.Wind.SpeedMPH 1 150591 382828756 367022
## - new_precipitation 1 153158 382831324 367022
## - BeautifulWeather 1 154392 382832557 367022
## - Mean.VisibilityMiles 1 256627 382934793 367033
## - Mean.TemperatureF 1 532026 383210192 367062
## - feellike 1 584923 383263089 367067
## - Mean.Humidity 1 598349 383276515 367069
## - season 3 881159 383559325 367077
## - Subscription.Type 1 51481685 434159851 372054
## - hour 23 78209352 460887517 374210
## - City 8 313962972 696641138 390889
##
## Call:
## lm(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + City + feellike + season + BeautifulWeather,
## data = na.omit(linearreg_train))
##
## Coefficients:
## (Intercept) hour1
## -299.5995 -28.0048
## hour2 hour3
## -39.9391 -71.0196
## hour4 hour5
## -66.7942 -5.3131
## hour6 hour7
## 37.4923 87.6110
## hour8 hour9
## 116.2905 86.8799
## hour10 hour11
## 77.5900 87.0558
## hour12 hour13
## 97.9970 97.2500
## hour14 hour15
## 94.7294 100.1506
## hour16 hour17
## 112.4641 138.4482
## hour18 hour19
## 129.4917 103.4653
## hour20 hour21
## 79.7041 60.9151
## hour22 hour23
## 42.7005 22.3699
## Subscription.TypeRegistered Mean.TemperatureF
## 79.3157 -20.4128
## Mean.Humidity Mean.VisibilityMiles
## -51.4137 2.6898
## Mean.Wind.SpeedMPH new_precipitation
## -0.6479 -7.7999
## CityArlington CityBethesda
## 32.8642 -10.0070
## CityChevy Chase CityDerwood
## -25.6455 -53.4749
## CityRockville CitySilver Spring
## -25.0290 -26.2259
## CityTakoma Park CityWashington
## -21.3188 206.3001
## feellike seasonSpring
## 24.6193 2.8306
## seasonSummer seasonWinter
## -1.0607 -16.2409
## BeautifulWeatherTrue
## 5.0319
stepAIC(na.omit(m_full_linear), k=2 )#AIC
## Start: AIC=366616.5
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + rushhour + weekend_holiday + feellike +
## season + AdverseWeather + BeautifulWeather
##
##
## Step: AIC=366616.5
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + rushhour + weekend_holiday + feellike +
## season + BeautifulWeather
##
##
## Step: AIC=366616.5
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend + weekend_holiday + feellike + season +
## BeautifulWeather
##
##
## Step: AIC=366616.5
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + Events +
## City + weekday + weekend_holiday + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - Events 11 192469 382061358 366615
## - Mean.Sea.Level.PressureIn 1 11670 381880560 366616
## <none> 381868889 366616
## - Mean.Wind.SpeedMPH 1 50392 381919282 366620
## - BeautifulWeather 1 94930 381963819 366624
## - CloudCover 8 254980 382123869 366627
## - weekend_holiday 1 131879 382000768 366628
## - new_precipitation 1 168076 382036966 366632
## - Mean.VisibilityMiles 1 211617 382080507 366637
## - weekday 6 351598 382220487 366641
## - Mean.Humidity 1 296564 382165454 366646
## - Mean.TemperatureF 1 497591 382366480 366667
## - feellike 1 538786 382407675 366671
## - season 3 702779 382571668 366684
## - Subscription.Type 1 51485750 433354639 371672
## - hour 23 78275757 460144647 374027
## - City 8 314377810 696246699 390620
##
## Step: AIC=366614.6
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City +
## weekday + weekend_holiday + feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## - Mean.Sea.Level.PressureIn 1 9607 382070965 366614
## <none> 382061358 366615
## - Mean.Wind.SpeedMPH 1 69908 382131266 366620
## - weekend_holiday 1 119617 382180975 366625
## - new_precipitation 1 129748 382191106 366626
## - CloudCover 8 271415 382332773 366627
## - BeautifulWeather 1 141105 382202463 366627
## - Mean.VisibilityMiles 1 232143 382293501 366637
## - weekday 6 361679 382423037 366640
## - Mean.Humidity 1 299748 382361106 366644
## - Mean.TemperatureF 1 563989 382625347 366672
## - feellike 1 614706 382676064 366677
## - season 3 749999 382811357 366687
## - Subscription.Type 1 51452393 433513751 371665
## - hour 23 78304274 460365632 374024
## - City 8 314288739 696350097 390603
##
## Step: AIC=366613.6
## total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + CloudCover + City + weekday + weekend_holiday +
## feellike + season + BeautifulWeather
##
## Df Sum of Sq RSS AIC
## <none> 382070965 366614
## - Mean.Wind.SpeedMPH 1 90856 382161821 366621
## - weekend_holiday 1 114682 382185646 366624
## - CloudCover 8 262319 382333284 366625
## - new_precipitation 1 134439 382205403 366626
## - BeautifulWeather 1 146779 382217744 366627
## - Mean.VisibilityMiles 1 236692 382307657 366636
## - weekday 6 359883 382430848 366639
## - Mean.Humidity 1 304016 382374981 366643
## - Mean.TemperatureF 1 554394 382625358 366670
## - feellike 1 605115 382676080 366675
## - season 3 787366 382858330 366690
## - Subscription.Type 1 51442821 433513785 371663
## - hour 23 78296246 460367211 374023
## - City 8 314280388 696351352 390602
##
## Call:
## lm(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + CloudCover + City + weekday + weekend_holiday +
## feellike + season + BeautifulWeather, data = na.omit(linearreg_train))
##
## Coefficients:
## (Intercept) hour1
## -312.7756 -28.1794
## hour2 hour3
## -40.1420 -70.9311
## hour4 hour5
## -66.7667 -4.7441
## hour6 hour7
## 37.7392 88.0400
## hour8 hour9
## 116.7162 87.0766
## hour10 hour11
## 77.7444 87.2682
## hour12 hour13
## 98.1012 97.4979
## hour14 hour15
## 94.7901 100.3806
## hour16 hour17
## 112.7349 138.7788
## hour18 hour19
## 129.7587 103.6887
## hour20 hour21
## 79.9594 61.2126
## hour22 hour23
## 42.9181 22.6577
## Subscription.TypeRegistered Mean.TemperatureF
## 79.5258 -21.6258
## Mean.Humidity Mean.VisibilityMiles
## -42.0769 2.6222
## Mean.Wind.SpeedMPH new_precipitation
## -0.5222 -7.5003
## CloudCover1 CloudCover2
## -5.1220 -2.5793
## CloudCover3 CloudCover4
## 2.2890 0.2518
## CloudCover5 CloudCover6
## 1.9416 3.1763
## CloudCover7 CloudCover8
## 0.8579 -4.7795
## CityArlington CityBethesda
## 32.9744 -9.9547
## CityChevy Chase CityDerwood
## -25.7636 -52.9329
## CityRockville CitySilver Spring
## -24.7075 -26.1301
## CityTakoma Park CityWashington
## -21.1539 206.5479
## weekdayTuesday weekdayWednesday
## -0.9617 3.0597
## weekdayThursday weekdayFriday
## 1.8053 5.1011
## weekdaySaturday weekdaySunday
## 14.3692 9.5128
## weekend_holiday1 feellike
## -7.6996 25.9732
## seasonSpring seasonSummer
## 3.1008 -0.4458
## seasonWinter BeautifulWeatherTrue
## -15.4146 5.1726
Results of variable selection for each technique:
Adjusted R Squared: hour + Subscription.Type + Mean.TemperatureF + Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles + Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City + weekday + rushhour + weekend_holiday + feellike + season + BeautifulWeather
BIC: hour + Subscription.Type + Mean.TemperatureF + Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH + new_precipitation + City + feellike + season + BeautifulWeather
AIC: hour + Subscription.Type + Mean.TemperatureF + Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City + weekday + weekend_holiday +feellike + season + BeautifulWeather
Based on these results, models were created for each technique.
m_full_linear <- lm(total_rides ~ hour + Subscription.Type + Mean.TemperatureF + Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles + Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City + weekday + rushhour + weekend_holiday + feellike + season + BeautifulWeather, data = na.omit(linearreg_train))
summary(m_full_linear)
##
## Call:
## lm(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.Sea.Level.PressureIn + Mean.VisibilityMiles +
## Mean.Wind.SpeedMPH + new_precipitation + CloudCover + City +
## weekday + rushhour + weekend_holiday + feellike + season +
## BeautifulWeather, data = na.omit(linearreg_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.73 -48.37 -14.50 30.24 914.40
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -407.87203 96.35088 -4.233 2.31e-05 ***
## hour1 -28.16655 4.63810 -6.073 1.27e-09 ***
## hour2 -40.15203 4.96819 -8.082 6.56e-16 ***
## hour3 -70.93993 5.53616 -12.814 < 2e-16 ***
## hour4 -66.76924 5.36114 -12.454 < 2e-16 ***
## hour5 -4.72478 4.47359 -1.056 0.290906
## hour6 37.74438 4.05901 9.299 < 2e-16 ***
## hour7 88.06122 3.84926 22.877 < 2e-16 ***
## hour8 116.72090 3.74663 31.154 < 2e-16 ***
## hour9 87.08657 3.75745 23.177 < 2e-16 ***
## hour10 77.74462 3.77619 20.588 < 2e-16 ***
## hour11 87.27396 3.76751 23.165 < 2e-16 ***
## hour12 98.10265 3.77188 26.009 < 2e-16 ***
## hour13 97.50103 3.76604 25.890 < 2e-16 ***
## hour14 94.79978 3.78427 25.051 < 2e-16 ***
## hour15 100.40084 3.76648 26.656 < 2e-16 ***
## hour16 112.75122 3.72985 30.229 < 2e-16 ***
## hour17 138.79079 3.70094 37.501 < 2e-16 ***
## hour18 129.77452 3.71909 34.894 < 2e-16 ***
## hour19 103.71492 3.78834 27.377 < 2e-16 ***
## hour20 79.97934 3.86823 20.676 < 2e-16 ***
## hour21 61.22147 3.94239 15.529 < 2e-16 ***
## hour22 42.92956 4.03620 10.636 < 2e-16 ***
## hour23 22.65927 4.15104 5.459 4.82e-08 ***
## Subscription.TypeRegistered 79.53976 1.08466 73.332 < 2e-16 ***
## Mean.TemperatureF -21.98741 2.86385 -7.678 1.66e-14 ***
## Mean.Humidity -41.80755 7.46944 -5.597 2.19e-08 ***
## Mean.Sea.Level.PressureIn 3.04969 3.04357 1.002 0.316344
## Mean.VisibilityMiles 2.59935 0.52771 4.926 8.44e-07 ***
## Mean.Wind.SpeedMPH -0.47524 0.17582 -2.703 0.006874 **
## new_precipitation -7.38128 2.00444 -3.682 0.000231 ***
## CloudCover1 -5.10061 4.57809 -1.114 0.265228
## CloudCover2 -2.44969 4.50222 -0.544 0.586371
## CloudCover3 2.44002 4.22101 0.578 0.563224
## CloudCover4 0.47551 4.30998 0.110 0.912150
## CloudCover5 2.17559 4.19920 0.518 0.604394
## CloudCover6 3.45778 4.23727 0.816 0.414482
## CloudCover7 1.09245 4.18520 0.261 0.794073
## CloudCover8 -4.75207 4.38947 -1.083 0.278990
## CityArlington 32.96667 1.63465 20.167 < 2e-16 ***
## CityBethesda -9.97142 1.88864 -5.280 1.30e-07 ***
## CityChevy Chase -25.77495 3.17858 -8.109 5.25e-16 ***
## CityDerwood -52.95412 5.67729 -9.327 < 2e-16 ***
## CityRockville -24.72285 2.45899 -10.054 < 2e-16 ***
## CitySilver Spring -26.13397 2.30150 -11.355 < 2e-16 ***
## CityTakoma Park -21.15562 2.40001 -8.815 < 2e-16 ***
## CityWashington 206.54416 1.59247 129.701 < 2e-16 ***
## weekdayTuesday -0.83838 1.91423 -0.438 0.661410
## weekdayWednesday 3.10669 1.89062 1.643 0.100348
## weekdayThursday 1.83079 1.88705 0.970 0.331959
## weekdayFriday 5.14735 1.90212 2.706 0.006811 **
## weekdaySaturday 14.49074 2.73438 5.299 1.17e-07 ***
## weekdaySunday 9.62622 2.73897 3.515 0.000441 ***
## rushhour1 NA NA NA NA
## weekend_holiday1 -7.89313 2.23236 -3.536 0.000407 ***
## feellike 26.40220 3.29396 8.015 1.13e-15 ***
## seasonSpring 3.26512 1.69344 1.928 0.053850 .
## seasonSummer -0.05362 2.08692 -0.026 0.979502
## seasonWinter -15.12865 2.06506 -7.326 2.42e-13 ***
## BeautifulWeatherTrue 5.08320 1.32366 3.840 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.82 on 39931 degrees of freedom
## Multiple R-squared: 0.4769, Adjusted R-squared: 0.4761
## F-statistic: 627.5 on 58 and 39931 DF, p-value: < 2.2e-16
BIC_model <- lm(total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
new_precipitation + City + feellike + season + BeautifulWeather, data = na.omit(linearreg_train))
summary(BIC_model)
##
## Call:
## lm(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + City + feellike + season + BeautifulWeather,
## data = na.omit(linearreg_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.74 -48.51 -14.50 30.39 912.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -299.5995 15.6944 -19.090 < 2e-16 ***
## hour1 -28.0048 4.6400 -6.035 1.60e-09 ***
## hour2 -39.9391 4.9702 -8.036 9.56e-16 ***
## hour3 -71.0196 5.5378 -12.824 < 2e-16 ***
## hour4 -66.7942 5.3635 -12.453 < 2e-16 ***
## hour5 -5.3131 4.4740 -1.188 0.2350
## hour6 37.4923 4.0597 9.235 < 2e-16 ***
## hour7 87.6110 3.8486 22.764 < 2e-16 ***
## hour8 116.2905 3.7467 31.038 < 2e-16 ***
## hour9 86.8799 3.7587 23.114 < 2e-16 ***
## hour10 77.5900 3.7778 20.538 < 2e-16 ***
## hour11 87.0558 3.7693 23.096 < 2e-16 ***
## hour12 97.9970 3.7739 25.967 < 2e-16 ***
## hour13 97.2500 3.7679 25.810 < 2e-16 ***
## hour14 94.7294 3.7862 25.020 < 2e-16 ***
## hour15 100.1506 3.7682 26.578 < 2e-16 ***
## hour16 112.4641 3.7315 30.139 < 2e-16 ***
## hour17 138.4482 3.7017 37.401 < 2e-16 ***
## hour18 129.4917 3.7196 34.814 < 2e-16 ***
## hour19 103.4653 3.7893 27.305 < 2e-16 ***
## hour20 79.7041 3.8693 20.599 < 2e-16 ***
## hour21 60.9151 3.9436 15.446 < 2e-16 ***
## hour22 42.7005 4.0374 10.576 < 2e-16 ***
## hour23 22.3699 4.1526 5.387 7.21e-08 ***
## Subscription.TypeRegistered 79.3157 1.0820 73.308 < 2e-16 ***
## Mean.TemperatureF -20.4128 2.7391 -7.452 9.36e-14 ***
## Mean.Humidity -51.4137 6.5054 -7.903 2.79e-15 ***
## Mean.VisibilityMiles 2.6898 0.5197 5.176 2.28e-07 ***
## Mean.Wind.SpeedMPH -0.6479 0.1634 -3.965 7.36e-05 ***
## new_precipitation -7.7999 1.9507 -3.998 6.39e-05 ***
## CityArlington 32.8642 1.6355 20.094 < 2e-16 ***
## CityBethesda -10.0070 1.8895 -5.296 1.19e-07 ***
## CityChevy Chase -25.6455 3.1798 -8.065 7.52e-16 ***
## CityDerwood -53.4749 5.6761 -9.421 < 2e-16 ***
## CityRockville -25.0290 2.4579 -10.183 < 2e-16 ***
## CitySilver Spring -26.2259 2.3023 -11.391 < 2e-16 ***
## CityTakoma Park -21.3188 2.4004 -8.881 < 2e-16 ***
## CityWashington 206.3001 1.5929 129.513 < 2e-16 ***
## feellike 24.6193 3.1507 7.814 5.67e-15 ***
## seasonSpring 2.8306 1.6102 1.758 0.0788 .
## seasonSummer -1.0607 2.0147 -0.526 0.5986
## seasonWinter -16.2409 1.9760 -8.219 < 2e-16 ***
## BeautifulWeatherTrue 5.0319 1.2534 4.015 5.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.88 on 39947 degrees of freedom
## Multiple R-squared: 0.476, Adjusted R-squared: 0.4755
## F-statistic: 864 on 42 and 39947 DF, p-value: < 2.2e-16
AIC_model <- lm(total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
new_precipitation + CloudCover + City + weekday + weekend_holiday +feellike +
season + BeautifulWeather, data = na.omit(linearreg_train))
summary(AIC_model)
##
## Call:
## lm(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
## Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
## new_precipitation + CloudCover + City + weekday + weekend_holiday +
## feellike + season + BeautifulWeather, data = na.omit(linearreg_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.26 -48.35 -14.46 30.26 914.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -312.7756 16.6262 -18.812 < 2e-16 ***
## hour1 -28.1794 4.6381 -6.076 1.25e-09 ***
## hour2 -40.1420 4.9682 -8.080 6.67e-16 ***
## hour3 -70.9311 5.5362 -12.812 < 2e-16 ***
## hour4 -66.7667 5.3611 -12.454 < 2e-16 ***
## hour5 -4.7441 4.4736 -1.060 0.288929
## hour6 37.7392 4.0590 9.298 < 2e-16 ***
## hour7 88.0400 3.8492 22.872 < 2e-16 ***
## hour8 116.7162 3.7466 31.152 < 2e-16 ***
## hour9 87.0766 3.7574 23.174 < 2e-16 ***
## hour10 77.7444 3.7762 20.588 < 2e-16 ***
## hour11 87.2682 3.7675 23.163 < 2e-16 ***
## hour12 98.1012 3.7719 26.009 < 2e-16 ***
## hour13 97.4979 3.7660 25.889 < 2e-16 ***
## hour14 94.7901 3.7843 25.049 < 2e-16 ***
## hour15 100.3806 3.7664 26.651 < 2e-16 ***
## hour16 112.7349 3.7298 30.225 < 2e-16 ***
## hour17 138.7788 3.7009 37.498 < 2e-16 ***
## hour18 129.7587 3.7191 34.890 < 2e-16 ***
## hour19 103.6887 3.7882 27.371 < 2e-16 ***
## hour20 79.9594 3.8682 20.671 < 2e-16 ***
## hour21 61.2126 3.9424 15.527 < 2e-16 ***
## hour22 42.9181 4.0362 10.633 < 2e-16 ***
## hour23 22.6577 4.1510 5.458 4.83e-08 ***
## Subscription.TypeRegistered 79.5258 1.0846 73.325 < 2e-16 ***
## Mean.TemperatureF -21.6258 2.8410 -7.612 2.76e-14 ***
## Mean.Humidity -42.0769 7.4646 -5.637 1.74e-08 ***
## Mean.VisibilityMiles 2.6222 0.5272 4.974 6.60e-07 ***
## Mean.Wind.SpeedMPH -0.5222 0.1695 -3.082 0.002061 **
## new_precipitation -7.5003 2.0009 -3.748 0.000178 ***
## CloudCover1 -5.1220 4.5780 -1.119 0.263229
## CloudCover2 -2.5793 4.5004 -0.573 0.566553
## CloudCover3 2.2890 4.2183 0.543 0.587393
## CloudCover4 0.2518 4.3042 0.058 0.953355
## CloudCover5 1.9416 4.1927 0.463 0.643299
## CloudCover6 3.1763 4.2279 0.751 0.452497
## CloudCover7 0.8579 4.1787 0.205 0.837343
## CloudCover8 -4.7795 4.3894 -1.089 0.276212
## CityArlington 32.9744 1.6346 20.172 < 2e-16 ***
## CityBethesda -9.9547 1.8886 -5.271 1.36e-07 ***
## CityChevy Chase -25.7636 3.1786 -8.105 5.40e-16 ***
## CityDerwood -52.9329 5.6773 -9.324 < 2e-16 ***
## CityRockville -24.7075 2.4589 -10.048 < 2e-16 ***
## CitySilver Spring -26.1301 2.3015 -11.354 < 2e-16 ***
## CityTakoma Park -21.1539 2.4000 -8.814 < 2e-16 ***
## CityWashington 206.5479 1.5925 129.703 < 2e-16 ***
## weekdayTuesday -0.9617 1.9103 -0.503 0.614657
## weekdayWednesday 3.0597 1.8900 1.619 0.105490
## weekdayThursday 1.8053 1.8869 0.957 0.338694
## weekdayFriday 5.1011 1.9016 2.683 0.007309 **
## weekdaySaturday 14.3692 2.7317 5.260 1.45e-07 ***
## weekdaySunday 9.5128 2.7366 3.476 0.000509 ***
## weekend_holiday1 -7.6996 2.2240 -3.462 0.000537 ***
## feellike 25.9732 3.2660 7.953 1.87e-15 ***
## seasonSpring 3.1008 1.6855 1.840 0.065814 .
## seasonSummer -0.4458 2.0499 -0.217 0.827847
## seasonWinter -15.4146 2.0452 -7.537 4.92e-14 ***
## BeautifulWeatherTrue 5.1726 1.3207 3.917 8.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.82 on 39932 degrees of freedom
## Multiple R-squared: 0.4768, Adjusted R-squared: 0.4761
## F-statistic: 638.5 on 57 and 39932 DF, p-value: < 2.2e-16
We then used each model to make prediction on test dataset and analyzed performance based on MAE and MRSE.
Adjusted R Squared: MAE of 60.25 and MRSE of 102.0942
BIC: MAE of 60.28 and MRSE of 102.094
AIC: MAE of 60.25 and MRSE of 102.1148
Additionally all three models resulted in an Adjusted R Squared of ~47%, which likely explains why all three models performed nearly the same.
mfull_pred <- predict(m_full_linear,linearreg_test)
## Warning in predict.lm(m_full_linear, linearreg_test): prediction from a
## rank-deficient fit may be misleading
linearreg_test$total_rides_mfull_pred=mfull_pred
BIC_pred <- predict(BIC_model,linearreg_test)
linearreg_test$total_rides_BIC_pred=BIC_pred
AIC_pred <- predict(AIC_model,linearreg_test)
linearreg_test$total_rides_AIC_pred=AIC_pred
MAE <- function(actual,predicted){
mean(abs(actual - predicted), na.rm = TRUE)
}
MAE(linearreg_test$total_rides,linearreg_test$total_rides_mfull_pred)
## [1] 59.06931
MAE(linearreg_test$total_rides,linearreg_test$total_rides_BIC_pred)
## [1] 59.07309
MAE(linearreg_test$total_rides,linearreg_test$total_rides_AIC_pred)
## [1] 59.07193
rmse(linearreg_test$total_rides,linearreg_test$total_rides_mfull_pred)
## [1] 99.52303
rmse(linearreg_test$total_rides,linearreg_test$total_rides_AIC_pred)
## [1] 99.52603
rmse(linearreg_test$total_rides,linearreg_test$total_rides_BIC_pred)
## [1] 99.50097
\section{Model 2 - Regression Tree Model}
Regression Tree Model (using similar variables of AIC model)
rpart_model <- rpart(total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
new_precipitation + CloudCover + City + weekday + weekend_holiday +feellike +
season + BeautifulWeather, na.omit(linearreg_train))
#Plot the tree
rpart.plot(rpart_model,digits = 3,fallen.leaves = TRUE,type = 4) # Plot the tree
fancyRpartPlot(rpart_model)
Predict on test dataset for regression tree model
rpart_pred<-predict(rpart_model,linearreg_test)
linearreg_test$total_rides_rpart_pred=rpart_pred # Save the predictions as variable total_rides_pred_rpart on test
Measure model performance with Mean Absolute Error (MEA) to evaluate the model
MAE(linearreg_test$total_rides,linearreg_test$total_rides_rpart_pred)
## [1] 22.74462
Measure model fit with Root Mean Square Error (RMSE) to evaluate the standard deviation of the model prediction error. A smaller value indicates better model performance.
rmse(linearreg_test$total_rides,linearreg_test$total_rides_rpart_pred)
## [1] 53.34946
\section{Model 3 - Random Forest Model}
Random Forest Model (using similar variables of AIC model)
set.seed(123)
rf_model <- randomForest(total_rides ~ hour + Subscription.Type + Mean.TemperatureF +
Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH +
new_precipitation + City + feellike + season + BeautifulWeather, data=linearreg_train,importance=TRUE,na.action=na.omit)
rf_model
##
## Call:
## randomForest(formula = total_rides ~ hour + Subscription.Type + Mean.TemperatureF + Mean.Humidity + Mean.VisibilityMiles + Mean.Wind.SpeedMPH + new_precipitation + City + feellike + season + BeautifulWeather, data = linearreg_train, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 2327.949
## % Var explained: 87.25
plot(rf_model, main="Random Forest") # Plot model accuracy by class
importance(rf_model) # Look variable importance
## %IncMSE IncNodePurity
## hour 193.39113 162055422
## Subscription.Type 243.99614 77937364
## Mean.TemperatureF 43.93934 20327312
## Mean.Humidity 56.23444 19899546
## Mean.VisibilityMiles 39.10003 6091913
## Mean.Wind.SpeedMPH 44.85421 13628751
## new_precipitation 40.54293 9670355
## City 324.03633 317222045
## feellike 51.47854 26903485
## season 48.20983 12539232
## BeautifulWeather 28.24067 3528786
varImpPlot(rf_model, main="Random Forest by variable importance")
Predict on test dataset for Random Forest
rf_pred<-predict(rf_model,linearreg_test)
linearreg_test$total_rides_rf_pred=rf_pred # Save the predictions as variable total_rides_pred_rf on test
Measure model performance with Mean Absolute Error (MEA) to evaluate the model
MAE(linearreg_test$total_rides,linearreg_test$total_rides_rf_pred)
## [1] 17.87722
Measure model fit with Root Mean Square Error (RMSE) to evaluate the standard deviation of the model prediction error. A smaller value indicates better model performance.
rmse(linearreg_test$total_rides,linearreg_test$total_rides_rf_pred)
## [1] 48.67429
Save all the predictions by day and hour
linearreg_test[is.na(linearreg_test)]<-0
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level,
## NA generated
predictions_df<-as.data.frame(linearreg_test) %>%
group_by(date,hour)%>%
summarise(real=sum(total_rides),
predictions_mfull=sum(total_rides_mfull_pred),
predictions_bic=sum(total_rides_BIC_pred),
predictions_aic=sum(total_rides_AIC_pred),
predictions_rpart=sum(total_rides_rpart_pred),
predictions_rf=sum(total_rides_rf_pred))
head(predictions_df)
## Source: local data frame [6 x 8]
## Groups: date [1]
##
## date hour real predictions_mfull predictions_bic
## <date> <fctr> <int> <dbl> <dbl>
## 1 2015-01-01 0 10 24.16185 31.16064
## 2 2015-01-01 1 12 -98.04242 -90.96437
## 3 2015-01-01 2 1 -142.99458 -135.76282
## 4 2015-01-01 3 1 -140.81580 -133.97916
## 5 2015-01-01 5 6 98.97683 105.16323
## 6 2015-01-01 8 1 0.00000 0.00000
## # ... with 3 more variables: predictions_aic <dbl>,
## # predictions_rpart <dbl>, predictions_rf <dbl>
write.csv(predictions_df,file="predictions_df.csv",row.names=FALSE)
\section{4. Discussion}
Comparing the MAE and MRSE across the models that showcased that the Random Forest model provided the most accurate predictions of hourly ridership.
\section{References}R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/.
\section{Appendix}\section{I. Authors' Individual Contribution}Elvin did the data exploration, Hellen designed the Random Forest Model, Lee designed the Linear Models, and Tarek design the Regression Tree model. All the team members contributted to the data collection, preprocessing, and analysis.